Multilingual Jailbreak Attacks
Overview
The objective of the Multilingual Jailbreaking Evaluation is to assess model vulnerabilities with regards to jailbreaking attacks in a particular language. This test is designed with input from native speakers hired as annotators, and includes language-specific insights across 14 different safety categories.
Similar to Static and Adaptive Jailbreak Attacks, the Multilingual Jailbreak Attack tests whether a model outputs harmful content that violates general ethical, legal, or safety guidelines when prompted with a malicious user request in the specified non-English language. In particular, it assesses the model's responses to culturally and linguistically relevant harmful prompts, testing its ability to maintain ethical boundaries and adhere to safety protocols within the specific language context. This approach allows for a comprehensive examination of the model's robustness against language-specific vulnerabilities and potential cultural biases in its safety mechanisms.
The Multilingual Jailbreak Attack evaluation leverages DynamoEval’s Static and Dynamic Jailbreak attacks against the following set of jailbreaking attack techniques inspired by latest research.
-
Baseline: We run a baseline test with Dynamo AI Harmful dataset without any augmentations. This test aims to evaluate the model’s safety filters when exposed to vanilla harmful prompts. No jailbreaking, persuasion, or obfuscation techniques are used.
-
Persuasive Adversarial Prompts (PAP): PAP leverages persuasive techniques grounded in social science research to manipulate LLMs into violating predetermined rules. This approach is utilized to assess the robustness of LLMs against sophisticated forms of prompt injection that can subtly coerce the LLM into unwanted behaviors. By integrating social psychology principles into prompt design, PAP tests the susceptibility of LLMs to nuanced manipulations that aim to bypass safeguards and exploit vulnerabilities in the LLM's decision-making processes.
-
Tree of Attacks with Pruning (TAP): See Adaptive Jailbreak Attacks
-
Iterative Refinement Induced Self-Jailbreak (IRIS): See Adaptive Jailbreak Attacks
Metrics
Attack success rate (ASR) refers to the percentage of attempted jailbreaks that are successfully executed by an attacker. The higher the ASR, the more successful the attack and the more vulnerable the model is to jailbreaking attacks.
Supported Languages
- Japanese
- English (We support English separately with the Static and Adaptive Jailbreak Attacks)
- French (Coming soon)
- Spanish (Coming soon)
- German (Coming soon)
- Mandarin (Coming soon)
- Turkish (Coming soon)
- Thai (Coming soon)